Introduction

This project will aim to examine the underlying associations between specific age groups and different crimes within the neighborhoods of Toronto. Two different datasets will be utilized in this study; the first dataset is the “Toronto Neighbourhoods_shp” file which is a shape file that carries the’sf’ class according to R. It has 140 rows and 44 columns. Each row corresponds to a specific neighborhood within the city of Toronto and the columns correspond to different variables containing information about each neighborhood. These variables include data on the total area, total population, population of different age groups, different languages spoken, and the geographical coordinates of each neighborhood. The next dataset (“neighborhood-crime-rates - 4326-crime-rates - 4326.shp”) was downloaded from the city of Toronto website. This is also a shape file with a “sf” class and it has 158 rows and 185 columns. The rows represent specific neighborhoods within the city and the variables give information on the number of crimes and crimes rates in each neighborhood from the years 2014 to 2023. We will look at the proportion of four different age groups in each neighborhood and consider people aged 0-19 as “Children and Teens”, people aged 20-39 as “Young Adults”, people aged 40-59 as “Middle-Aged Adults and finally, people aged 60 and above as”Seniors”. Now according to Statistics Canada, a 2021 Census reported that downtown Toronto is a hotspot for millenials. Additionally, they reported that people who are between the ages of 15-64 account for 81.2% of downtown’s Toronto population which implies that a low population of seniors (i.e people aged 60 and above) live outside of the downtown core. They also reported that there were a smaller proportion of children in the downtown core and that seniors show pockets of concentration across Toronto in Neighborhoods like Rosedale-Moore Park and other northern Scarborough areas. Additionally, we will use the crime data set to fit spatial models that look at the association between each age group and other demographic variables. The independent variables we will focus on are: assault rate for 2023, homicide rate for 2023, and autotheft rate for 2023. Finally, we will construct more complex spatial models using different predictors from the “Toronto Neighbourhoods_shp” file such as ‘Population of Males’ and ‘Population of Females’.

Research Question

Can we discover significant spatial trends in the distribution of different age groups across Toronto’s neighborhoods and ultimately confirm the findings from the 2021 census? Additionally, can we identify any linear relationships between the different age groups and crime rates?

Hypothesis

Given the report of the 2021 census we expect the following results in our analysis: - signficant spatial correlation for the Young-Adults age group within the downtown neighborhoods since this is a ‘hotspot’ for millennials. - significant spatial autocorrelation for the Seniors age group within the downtown neighborhoods since 81.2% of the downtown population consist of people between the ages of 15-64. - significant spatial autocorrelation for the Seniors age group in the Northern Scarborough neighborhoods Note, we also expect to see more spatial trends that weren’t highlighted within the 2021 census considering census’ and visualization are not the most accurate methods for discovering spatial trends.

Methods

To begin with, we will merge the two data sets together using ‘st_intersects’ and then create four different sub populations according to age: Children and Teens (ages 0-19), Young adults (ages 20-39), Middle Aged Adults (ages 40-59), and Seniors (ages 60 and above). Each sub population will be calculated as a proportion of the total population, thus, the variables that we will add to our data set will have a proportion of each sub population for each respective neighborhood. It is important to note that the merged data sets will have exactly 140 columns (i.e 140 neighborhoods) since the ‘Toronto Neighborhoods’ shp. file has less rows than the ‘Neighborhood crime rates’ file.

Now, since we are working with areal data, we need to represent the proximity between areal units (i.e neighborhoods) with an adjacency matrix (proximity matrix). We will use both border based and distance based methods in order to accomplish this and using more than one method will hopefully show consistency within our results. For the border based method, we will use “Queen” connectivity and for the distance based we will use K-Nearest Neighbors (kNN) with k = 4,5, and 6. We chose these methods in particular because each neighborhood will have a minimum of four or a maximum of six neighbors. We believe that this interval is the optimal size for spatial trends to exist. From here, we will construct weight matrices for each method and the weights will be row standardized. For all four border methods, we will conduct a Moran’s I test in each subgroup which gives a total of 16 total tests with four for each group. The Moran’s I is a statistical test defined as:

\[ I = \frac{1}{s^2} \frac{\Sigma_{i}\Sigma_{j}w_{ij}(y_{i}-\overline{y})(y_{j}-\overline{y})}{\Sigma_{i}\Sigma_{j}w_{ij}} \]

Consequently, we expect to see significant spatial autocorrelation when \(I > 0\) or \(I < 0;\) \(\implies p < 0.05\). After comparing the Moran’s I test statistic between the four border based methods we will choose the most optimal one (i.e lowest p-value). Furthermore, we will use the montecarlo approach and correlograms of the Moran’s I statistic in order to determine which weight matrix is best to use. Then, using our chosen matrix, we will quantify local spatial autocorrelation between Toronto neighborhoods using the the Local Moran’ I and the Getis-Ord G* test. Both of these tests are Local Indexes of Spatial Autocorrelation (LISA) which means that they can give an indication of the extent of spatial clustering around one areal unit rather than the whole region. Similar to the global Moran’s I, the Local Moran’s I shows significant spatial autocorrelation locally when \(I > 0;(p < 0.05)\) or \(I < 0;(p< 0.05)\) and the Getis_Ord G* produces a z-score represented by the variable \(G_{i}^{*}\). A group of areal units with high \(G_{i}^{*};(p<0.05)\) indicates a ‘hotspot’ whereas a low \(G_{i}^{*};(p < 0.05)\) indicates a ‘coldspot’.

For the next part of our study, we aim to predict the rates of different crimes in Toronto by using the four age groups as linear predictors. For each of the crime rates (assault, autotheft, and homicide) we will first scale the response variables and then construct a simple linear regression model using the four age groups as predictors and then use backwards selection to find the most optimal model. Then, we will conduct a Moran’s I test on the residuals in order to see if there is any spatial dependence between the error terms. Next, we need to account for the spatial dependence between neighborhoods by using autoregressive terms in our linear model which begs the need for simaultaneous-autoregressive models (SAR) and conditional-autoregressive models (CAR). There are three types of SAR models: SAR error (spatial dependence in the error terms only), SAR lag (spatial dependence in the lag terms only), and SAR mixed (spatial dependence in both lag and error). A CAR model in contrast defines spatial dependence by specifying a Gaussian conditional distribution for each observation given its neighbors. Finally, the last model type we will construct is a linear mixed effect model (LMM) with a random intercept term for each neighborhood. This model will account for spatial dependence by specifying a Matern correlation structure in the error terms. Now, using the ‘optimal’ predictors from the simple linear regression model, we will construct the five models stated above and compare each of them with the likelihood ratio test. We will also compute the Moran’s I test statistic for the residuals of each model and examine the spatial dependence in the residuals. In the final part of our study, we will add two new predictors in our model: ‘Pop_Males’ and ‘Pop_Female’. We will first scale both of these variables before including them and then repeat the steps outlined before. The final models will be compared using the likelihood ratio test.

Results

Summary Statistics

Table 1: Summary statistics for each age group
Children_Teens Young_Adults Middle_aged Seniors
Min. 0.0684424 0.1738382 0.1904885 0.0788095
1st Qu. 0.1882813 0.2515311 0.2783513 0.1695733
Median 0.2125895 0.2798399 0.2959825 0.1972979
Mean 0.2107367 0.2943919 0.2934439 0.2013889
3rd Qu. 0.2432896 0.3144555 0.3120772 0.2279440
Max. 0.3217165 0.6107143 0.3470546 0.3246692
**Figure 1:Comapring popualtion distribution of age groups across 140 Toronto Neighbourhoods**

Figure 1:Comapring popualtion distribution of age groups across 140 Toronto Neighbourhoods

**Figure 2:Histograms for the proportions of each age group**

Figure 2:Histograms for the proportions of each age group

Table 1 shows that ‘Young Adults’ and ‘Middle-Aged Adults’ have the two highest average proportion of individuals across Toronto Neighborhoods (0.294 and 0.293 respectively) among the four categories. ‘Seniors’ have the lowest average proportion at 0.201 while ‘Children and Teens’ have an average of 0.21. The boxplots in Figure 1 give a visual representation of the summary statistics in Table 1 where the differences in mean and median (black line) between the groups are easily discernible. We also see several outliers in the ‘Young Adults’ and ‘Middle-Aged Adults’ boxplots which indicates that there are several neighborhoods with an outstanding proportion of Young Adults and several other neighborhoods with an uncharacteristically low amount of Middle Aged individuals. Figure 2 shows the distribution of the proportion of individuals in each neighborhood with respect to each age group. The ‘Children and Teens’ histogram is a unimodal, wide histogram with a peak between the 0.2-0.25 interval. The ‘Young Adults’ histogram is a narrow, unimodal histogram that is right skewed and peaks between the 0.25-0.3 interval. The ‘Middle-Aged’ adults histogram is left skewed, wide, and has a single peak between the 0.25 and 0.35 interval. Finally, the ‘Seniors’ histogram is relatively wide and almost resembles a uniform distribution between the 0.15 and 0.25 interval, the largest peak is between 0.175 and 0.2.

Map’s 1-4 give insight on potential neighborhood clusters and patterns with respect to each age group. Map 1 shows that the majority of downtown neighborhoods fall within the 0-25th percentile for the population of Children and Teens. Map 2 shows that the downtown neighborhoods have a higher proportion of ‘Young Adults’ while Map 3 shows a even distribution of ‘Middle Aged Adults’ across Toronto. Finally, Map 4 indicates that the downtown core neighborhoods fall within the 0-25th percentile for the Senior population.

Computing the Global Moran’s I with different border based methods

Table 2: Morans I test results for Children/Teens
Moran.s.I p.value
Queen 0.5904388 0
KNN N=4 0.5887324 0
KNN N=5 0.5401040 0
KNN N=6 0.5254209 0
**Figure 3: Correlograms for the Moran' I test: Children/Teens**

Figure 3: Correlograms for the Moran’ I test: Children/Teens

Table 3: Moran’s I test results for Young adults
Moran.s.I p.value
Queen 0.6112999 0
KNN N=4 0.6054979 0
KNN N=5 0.5724384 0
KNN N=6 0.5465047 0
**Figure 4: Correlograms for the Moran's I test: Young Adults**

Figure 4: Correlograms for the Moran’s I test: Young Adults

Table 4: Moran’s I test results for Middle-Aged adults
Moran.s.I p.value
Queen 0.4928532 0
KNN N=4 0.4679394 0
KNN N=5 0.4380185 0
KNN N=6 0.4151052 0
**Figure 5: Correlograms for the Moran's I test: Middle-Aged Adults**

Figure 5: Correlograms for the Moran’s I test: Middle-Aged Adults

Table 5: Moran’s I test results for Seniors
Moran.s.I p.value
Queen 0.2082105 1.04e-05
KNN N=4 0.2657183 1.00e-06
KNN N=5 0.2534780 2.00e-07
KNN N=6 0.2368294 1.00e-07
**Figure 6: Correlograms for the Moran's I: Seniors**

Figure 6: Correlograms for the Moran’s I: Seniors

**Figure 7: Monte Carlo simulations for the Moran's I tests in each age group with Queen connectivity**

Figure 7: Monte Carlo simulations for the Moran’s I tests in each age group with Queen connectivity

In all of the Moran’s I tests, \(p < 0.05\) and the Moran’s I test statistic was always greater than 0. Also, for the ‘Children and Teens’,‘Young Adults’, and ‘Middle-Aged Adults’ age groups, the Moran’s I showed the highest test statistic and the lowest p-value for the Queen connectivity. In contrast, the ‘Seniors’ Moran’s I test had the highest test statistic and the lowest p-value for the KNN N=4 method. The results for the Monte Carlo simulation in Figure 7 shows the exact same test statistic (red line) as the ‘Queen’ Moran’s I test in tables 2,3,4 and 5. Finally, the correlograms in Figures 3,4,5, and 6 show the least amount of lags for the Moran’s I test for the Queen border based method in all four age groups (approximately 2-3 lags). Consequently, we used the ‘Queen’ neighborhood for the rest of our analysis because it had the highest Moran’s I test statistic and was thus the most suitable for indenitfying significant spatial clusters.

Local Moran’s I and Getis-Ord G* map’s

Maps 5,7,9 and 100 show a visual representation regarding the Local Moran’s I tests for each age group. The neighborhoods that are outlined in green indicate that the Moran’s I test for that specific neighborhood was significant \(p < 0.05\) at an \(\alpha = 0.05\) significance level and that the Moran’s I test was greater than or lower than 0. Maps 6, 8, 10, and 12 show a visual representation regarding the Getis-Ord G* tests for each specific neighborhood. Neighborhoods outlined in green are again statistically significant \((p<0.05)\) at an \(\alpha = 0.05\) significance level.

Predicting Assault Rate

Table 6: Results for the Likelihood Ratio test for the SAR,CAR, and LMM models - Assault ~ Children + Seniors
LR Test P-value for LR P-value from Moran’s I of residuals
Error Only 53.2150 <0.05 0.3937
Lag Only 57.1140 <0.05 0.3773
Lag-Error Model 57.2170 <0.05 0.4442
CAR 44.4820 <0.05 0.8616
LMM 44.4814 <0.05 0.8466

When regressing all four predictors on the assault crime rate (i.e the response variable), the results from this simple linear regression model showed that the ‘Middle-Aged Adults’ and ‘Young Adults’ covariates were not statistically significant \((p > 0.05)\) at an \(\alpha = 0.05\) significance level. This model had an adjusted \(R^{2} = 0.2239\). Using backwards selection, we first removed the covariate with the highest value (‘Young Adults’) and then fit the model again. In this model, the coefficient for ‘Middle-Aged Adults’ was again not statistically significant and adjusted \(R^{2} = 0.2258\); thus, we removed this covariate and fit the model with the remaining two variables. This final model showed that all coefficient were statistically significant \((p < 0.05)\) at an \(\alpha = 0.05\) significance level; adjusted \(R^{2} = 0.2314\). The data from table 6 shows that the SAR lag-error model had the highest statistic for the Likelihood ratio test and the CAR model had the highest p-value for the Moran’s I for its residuals. The coefficients for the SAR lag-error model were -3.98770 and -3.20618 for the Children/Teens and Senior variables respectively. It is also important to note that in all the Moran’s I tests for the residuals the p-value was greater than 0.05.

Predicting Autotheft Rate

**Figure 9: Comparing the residuals between the SAR and CAR models**

Figure 9: Comparing the residuals between the SAR and CAR models

Table 7: Results for the Likelihood Ratio test for the SAR,CAR, and LMM models - Autotheft ~ Middle Aged + Young Adults
LR Test P-val for LR P-value from Moran’s I of residuals
Error Only 11.51800 <0.05 0.5225
Lag Only 15.03600 <0.05 0.6497
Lag-Error Model 16.71500 <0.05 0.4700
CAR 5.56610 <0.05 0.9791
LMM 5.56614 <0.05 0.9185

When regressing all four predictors on the autotheft crime rate, the results from this simple linear regression model showed that the ‘Children/Teens’ and ‘Senior’ covariates were not statistically signficant significant \((p > 0.05)\) at an \(\alpha = 0.05\) significance level. This model had an adjusted \(R^{2} = 0.1685\). Using backwards selection, we first removed the covariate with the highest value (‘Seniors’) and then fit the model again. In this model, the coefficient for ‘Children/Teens’ was again not statistically significant and adjusted \(R^{2} = 0.1746\); thus, we removed this covariate and fit the model with the remaining two variables. This final model showed that all coefficient were statistically significant \((p < 0.05)\) at an \(\alpha = 0.05\) significance level; adjusted \(R^{2} = 0.1737\). The data from table 7 shows that the SAR lag-error model has the highest statistic for the Likelihood ratio test and the CAR model has the highest p-value for the Moran’s I for its residuals. The coefficients for the SAR lag-error model were -3.3846 and -4.7656 for the Young Adults and Middle Aged adults variables respectively. It is also important to note that in all the Moran’s I tests for the residuals the p-value was greater than 0.05.

Predicting Homicide Rate

**Figure 10: Comparing the resiudals between the SAR and CAR models**

Figure 10: Comparing the resiudals between the SAR and CAR models

Table 8: Comparing the results for the Likelihood Ratio test for the SAR, CAR, and LMM models - Homicide ~ Children + Seniors
LR Test P-val for LR P-value from Moran’s I of residuals
Error Only 3.649600 <0.05 0.4112
Lag Only 6.236100 <0.05 0.6251
Lag-Error Model 7.422500 <0.05 0.4953
CAR 1.431900 <0.05 0.7833
LMM 2.738244 <0.05 0.6542

When regressing all four predictors on homicide crime rate (i.e the response variable), the results from this simple linear regression model showed that the ‘Middle-Aged Adults’ and ‘Young Adults’ covariates were not statistically significant \((p > 0.05)\) at an \(\alpha = 0.05\) significance level. This model had an adjusted \(R^{2} = 0.05172\). Using backwards selection, we first removed the covariate with the highest value (‘Middle-Aged Adults’) and then fit the model again. In this model, the coefficient for ‘Young Adults’ was again not statistically significant and adjusted \(R^{2} = 0.05607\); thus, we removed this covariate and fit the model with the remaining two variables. This final model showed that all coefficients were statistically significant \((p < 0.05)\) at an \(\alpha = 0.05\) significance level; adjusted \(R^{2} = 0.05484\). The data from table 6 shows that the SAR lag-error model has the highest statistic for the Likelihood ratio test and the CAR model has the highest p-value for the Moran’s I for its residuals. The coefficients for the SAR lag-error model were 2.7666 and 5.3854 for the Children/Teens and Senior variables respectively. It is also important to note that in all the Moran’s I tests for the residuals the p-value was greater than 0.05.

Finally lets examine crime rates using other independent variables in the dataset

Table 9: Comparing the results for the Likelihood Ratio test for the Assault rate model with more predictors - Assault ~ Children + Males + Females
LR Test P-val for LR P-value from Moran’s I of residuals
Error Only 34.97500 <0.05 0.3686
Lag Only 44.10000 <0.05 0.5344
Lag-Error Model 44.25100 <0.05 0.4558
CAR 27.90700 <0.05 0.8989
LMM 30.58911 <0.05 0.8954
Table 10: Comparing the results for the Likelihood Ratio test for the Autotheft rate model with more predictors - Autotheft ~ Young Adults + Middle Aged Adults + Males + Female
LR Test P-val for LR P-value from Moran’s I of residuals
Error Only 9.415300 <0.05 0.5206
Lag Only 14.068000 <0.05 0.7100
Lag-Error Model 16.409000 <0.05 0.4831
CAR 3.851200 <0.05 0.9617
LMM 3.851152 <0.05 0.8661
Table 11: Comparing the results for the Likelihood Ratio test for the Homicide rate model with more predictors - Homicide ~ Males and Females
LR Test P-val for LR P-value from Moran’s I of residuals
Error Only 2.537500 <0.05 0.4130
Lag Only 5.424000 <0.05 0.6480
Lag-Error Model 6.556200 <0.05 0.4902
CAR 1.114500 <0.05 0.7325
LMM 1.114479 <0.05 0.5994

When we regressed assault crime rate on ‘Children_Teens’, ‘Seniors’ ‘Pop_Males’ and ‘Pop_Female’ variables, we again used backwards selection and discovered that the most optimal model (where all coefficients are significant: \(p<0.05\)) consists of the ‘Children_Teens’, ‘Pop_Female’, and ‘Pop_Male variables. The Moran’s I test of the residuals for this model was significant (\(p<0.05\)) at an \(\alpha = 0.05\) significance level. Table 9 compares the likelihood ratio test results between the SAR, CAR, and LMM models. Similar to our results above, the SAR lag-error model had the highest likelihood ratio test statistic and the CAR model had the highest p-value for the Moran’s I residuals tests. Next, regressing auto theft on our original predictors along with the two new predictors produced a model with ’Young Adults’, ‘Middle-Aged adults’, ‘Pop_Males’, and ‘Pop Females’. Backward selection did not produce any models that were a better fit, thus, we stuck with the original four variables. Again, the SAR lag-error model had the highest likelihood ratio test statistic and the CAR model has the highest p-value among the Moran’s I tests of the residuals. Finally, the most optimal predictors for homicide data were just the ‘Pop_Males’ and ‘Pop_Female’ covariates. In this model SAR lag-error model had the highest likelihood ratio test statistic and the CAR model had the highest p value among the Moran’s I tests for the residuals,

Discussion

Our findings from the first part of our study confirm some of the claims from the 2021 census. The maps from the Local Moran’s I and Getis-Ord G* maps showed that there is a cluster of neighborhoods with higher amounts of young adults in the downtown core and a cluster of neighborhoods with lower amounts of seniors in the downtown core. However, our maps also show the following: cluster of neighborhoods with low amounts of children and teens in the downtown core, clusters of neighborhoods in north Etobicoke and Scarborough with high amounts of children, cluster of neighborhoods with low amounts of young adults in west Etobicoke, cluster of neighborhoods with low amounts of middle aged adults in the downtown core, cluster of neighborhoods with high amounts of middle aged adults in East York, and cluster of neighborhoods with high amounts of Seniors in west Etobicoke.

In the second part of our study we aimed to predict rates for different crimes using the four different age groups as predictors. For assault crime rate, the SAR Lag-Error Model had the highest likelihood ratio test statistic and was thus the most suitable model. A Moran’s I test on the residuals of the simple linear regression indicated that the spatial error terms were indeed spatially correlated, thus, a SAR model that accounted for spatial dependence in the Lags and Error was clearly the best fit. Among all the different crimes, the SAR lag-error model always had the highest likelihood ratio test statistic and was thus always the most suitable model. Even when we added the two new variables ‘Pop_Males’ and ‘Pop_Female’ in our models, the SAR model continued to have the highest likelihood ratio test statistic. The results from our spatial models also indicated that an increase in children and seniors negatively affected the crime rate but positively affect homicide rate; and an increase in middle aged adults and young adults negatively affect the auto theft rate. We believe that these results do have some validity however further research into the living trends of different age groups needs to be studied further. Nonetheless, our research relied on some pretty heavy assumptions and questionable methods. First, we assumed that the response and independent variables are linearly dependent which many not be the case. We also ignored the possibility of multicollinearity which is a big issue when considering gender distribution in certain age groups. For example, if the Senior population had a more significant higher proportion of females than males it would lead to improper results. We also used backwards selection to find our most optimal model however, research has shown that this method is prone to misspecification. Overall, the Local Moran’s I and Getis Ord G* Maps accurately depict the spatial trends of specific age groups within the city yet, the relationship between age groups and different crime rates begs the need for more research.